There are five columns that contain date information. These are:
These may be useful for later, but as they stand right now they’re not usable. Let’s make them datetime objects.
Write a function that takes in a dataframe and a column of choice, and adds columns to the dataframe that splits the datetime object. I want the year, the month (maybe cyclical?), the day, as well as the date as a decimal of the year.
split_to_columns <- function(df, input_colnum, output_colname_assoc_arr) {
dates <- strptime(df[, input_colnum], format="%Y-%m-%d %H:%M:%S")
df[output_colname_assoc_arr["year"]] <- as.numeric(strftime(dates, format="%Y"))
df[output_colname_assoc_arr["month"]] <- as.numeric(strftime(dates, format="%m"))
df[output_colname_assoc_arr["day"]] <- as.numeric(strftime(dates, format="%d"))
df[output_colname_assoc_arr["decimal_date"]] <- df[output_colname_assoc_arr["year"]] + (as.numeric(strftime(dates, format="%j"))/366)
return(df)
}
#column 3
listing_creation_date_set <- c()
listing_creation_date_set["year"] <- "ListingCreationDate_year"
listing_creation_date_set["month"] <- "ListingCreationDate_month"
listing_creation_date_set["day"] <- "ListingCreationDate_day"
listing_creation_date_set["decimal_date"] <- "ListingCreationDate_decimal"
#column 7
closed_date_set <- c()
closed_date_set["year"] <- "ClosedDate_year"
closed_date_set["month"] <- "ClosedDate_month"
closed_date_set["day"] <- "ClosedDate_day"
closed_date_set["decimal_date"] <- "ClosedDate_decimal"
#column 25
credit_pulled_set <- c()
credit_pulled_set["year"] <- "DateCreditPulled_year"
credit_pulled_set["month"] <- "DateCreditPulled_month"
credit_pulled_set["day"] <- "DateCreditPulled_day"
credit_pulled_set["decimal_date"] <- "DateCreditPulled_decimal"
#column 28
first_recorded_credit_set <- c()
first_recorded_credit_set["year"] <- "FirstRecordedCreditLine_year"
first_recorded_credit_set["month"] <- "FirstRecordedCreditLine_month"
first_recorded_credit_set["day"] <- "FirstRecordedCreditLine_day"
first_recorded_credit_set["decimal_date"] <- "FirstRecordedCreditLine_decimal"
#column 65
loan_orig_date_set <- c()
loan_orig_date_set["year"] <- "LoanOriginationDate_year"
loan_orig_date_set["month"] <- "LoanOriginationDate_month"
loan_orig_date_set["day"] <- "LoanOriginationDate_day"
loan_orig_date_set["decimal_date"] <- "LoanOriginationDate_decimal"
prosper <- split_to_columns(prosper, 3, listing_creation_date_set)
prosper <- split_to_columns(prosper, 7, closed_date_set)
prosper <- split_to_columns(prosper, 25, credit_pulled_set)
prosper <- split_to_columns(prosper, 28, first_recorded_credit_set)
prosper <- split_to_columns(prosper, 65, loan_orig_date_set)
Let’s make the credit score a little easier to work with by putting that average into our dataframe
prosper$CreditScoreMean = (prosper$CreditScoreRangeUpper + prosper$CreditScoreRangeLower)/2
First Plots: Histograms of potentially interesting things
Is the stated monthly income a continuous series of numbers or is it split into categories?
summary(prosper$StatedMonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
Looks to be continuous to me, between 0 and $1.75 million, but weighted heavily to < 10,000. This is an…odd distribution. I should be able to make a histogram of that too and see what the real distribution is.
This is the stated monthly income at the time of borrowing
What about the original loan amount?
Hmm…. Is there some correlation between loan amount and stated monthly income?
with(prosper, cor.test(LoanOriginalAmount, StatedMonthlyIncome))
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and StatedMonthlyIncome
## t = 69.353, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1956816 0.2068243
## sample estimates:
## cor
## 0.2012595
Correlation coefficient = 0.2012595. Quite lower than what I was expecting. Let’s see what the scatterplot says
I wonder if there’s any visible difference when coloring on Borrower APR instead of just term.
Somewhat. Smaller loans for those employed part time. A ton of borrowers are employed or self-employed though. Let’s see the employment statuses of borrowers more explicitly.
Let’s see them in their own element
This could be an interesting thing to look at. Is there a split on Term?
Oooh this is money right here. No pun intended.
Here’s another question then, is there a significant difference between the ways different employment statuses borrow over time?
The employed and full-time borrowers wash out everyone else, so let’s see what it looks like without them.
Curious about the Loan Principal vs the Stated Monthly Income
Hmm… what about vs the mean credit score
Onto other things. There’s a variable for the percent of the listing funded. I suspect there may be a correlation between the PercentFunded and the amount asked for.
with(prosper, cor.test(LoanOriginalAmount, PercentFunded))
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and PercentFunded
## t = -3.4594, df = 113940, p-value = 0.0005416
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.016053721 -0.004441884
## sample estimates:
## cor
## -0.01024815
-0.01024815 is basically no correlation at all. Disappointment.
Let’s see what that PercentFunded parameter looks like
Oh that’s why. Because the vaaaaast majority of loans are fully funded. That was useless and enlightening all at once.
Let’s quantify just how many are fully funded
sum(with(prosper, (PercentFunded == 1.0) & !is.na(PercentFunded))) / sum(with(prosper, !is.na(PercentFunded))) * 100
## [1] 99.23642
99.236%! Yeah, we’re not going to get any meaningful correlations out of this one.
Another column is LP_NetPrincipalLoss. It’s the principal that remains uncollected after any recoveries. Curious to see if there’s any relationship between that and the loan amount. Of course, we start with the cor.test function, then plot some stuff up to see the distribution.
with(prosper, cor.test(LP_NetPrincipalLoss, LoanOriginalAmount))
##
## Pearson's product-moment correlation
##
## data: LP_NetPrincipalLoss and LoanOriginalAmount
## t = 43.359, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1216916 0.1331162
## sample estimates:
## cor
## 0.1274081
A correlation coefficient of 0.1274. Not the worst I’ve seen thus far, but still not great. Let’s see the plot
Ok now this is actually somewhat interesting. It seems that when there is a non-zero loss, there might actually be some correlation. Let’s test that again with cor.test
with(subset(prosper, LP_NetPrincipalLoss > 0), cor.test(LP_NetPrincipalLoss, LoanOriginalAmount))
##
## Pearson's product-moment correlation
##
## data: LP_NetPrincipalLoss and LoanOriginalAmount
## t = 251.12, df = 16644, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8862690 0.8926146
## sample estimates:
## cor
## 0.8894846
WHOA! 0.8895%! That is a very strong positive correlation between loss on the principal and the principal loan amount. Plot says…! (with a linear model overplotted, as well as a 1-1 line for reference)
Now that’s interesting. It seems that in many cases, especially for the smaller loans, when there’s a loss it’s on most of the loan if not all of the loan.
Let’s plot the median Net Principal Loss vs Loan Principal here. For clarity, we should probably bin up the loan principal.
There’s another dimension to this that I’m not looking at. Does Income Range have an effect on how much of the loan is lost?
No real effect here. Most of these losses are coming in for the $25 - 75k range. What about the rating of the loan? For reference, AA is the lowest risk loan, and HR is highest-risk.
Interestingly enough, the highest-risk loans aren’t dominating this landscape of loan principal loss. At the low end? Sure there’s a bunch. But most of the loans lost seem to be mid-range in risk. Another dimension we haven’t looked at yet is the actual loss vs. the estimated loss. Let’s check that out
There’s effectively no correlation. cor.test will confirm
with(subset(prosper, (LP_NetPrincipalLoss > 0)), cor.test(EstimatedLoss, LP_NetPrincipalLoss/LoanOriginalAmount))
##
## Pearson's product-moment correlation
##
## data: EstimatedLoss and LP_NetPrincipalLoss/LoanOriginalAmount
## t = 6.6659, df = 6173, p-value = 2.859e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05972077 0.10925006
## sample estimates:
## cor
## 0.08453763
0.08453763 is basically nothing when only considering loans that have had losses.
with(prosper, cor.test(EstimatedLoss, LP_NetPrincipalLoss/LoanOriginalAmount))
##
## Pearson's product-moment correlation
##
## data: EstimatedLoss and LP_NetPrincipalLoss/LoanOriginalAmount
## t = 62.647, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2038176 0.2166797
## sample estimates:
## cor
## 0.2102577
0.2102577 with all loans, though still not well correlated when the data is plotted.
Let’s take these questions in a different direction and think from the point of view of the vendor. Let’s first ask, how much have lenders made on loans given? Is there some correlation with loan term? How about over time?
What this tells me is that no matter what, lenders have almost always made money on a loan. It’s also effectively telling me that the lender yield rate doesn’t depend much on time. Additionally, while the vast majority of loans are for 36 months, terms don’t seem to matter as much in whether or not there will be appreciable yield (although short term loans have a yield cap below 0.3). Interesting how lending effectively came to a stop between the end of 2008 and mid 2009. The financial crisis really shut things down.
I’m curious about these parameters involving “investments made by friends”. Let’s do histograms and time-series stuff.
Largely fruitless. Except for that last one. Let’s look at that in some greater detail.
Hmm. I have an idea of how this plays out with respect to loan amount but I want to see it for myself. First, I need to bin up some data.
prosper$LoanOriginalAmount.buckets <- cut(prosper$LoanOriginalAmount,
quantile(prosper$LoanOriginalAmount,
probs=c(0, .2, .4, .6, .8, 1.)),
labels = c("$1,000-3,000", "$3,000-5,000",
"$5,000-8,500", "$8,500-15,000",
"$15,000-35,000"))
Now revisit those investors
I don’t really…see this one going anywhere. I’ve stepped through a number of interesting variables though.
Another thing, let’s look at LenderYield with respect to the borrower rate. A simple relationship.
And it comes out as expected. The higher the borrower rate, the higher the lender yield.
What if we split this relationship up on different variables?
What else influences lender yield rate?
Hmm, now that’s interesting. There’s a slightly higher median yield rate from borrowers that DON’T own homes. I wonder why that is? Does term length matter?
No matter the term length, those who don’t own homes provide a larger Lender Yield Rate! What about income range?
Now that is interesting. Lender yield tends to be higher for borrowers that are not employed. And specifically, it bucks the trend that non-homeowning borrowers provide a generally higher yield than homeowning borrowers.
Let’s check on one more thing that just came to mind: Debt/Income ratio vs avg credit rating, colored by loan amount
Not entirely surprising but still interesting to see, the highest loans are given to people with better credit scores, while lower loans are available for most folks. What was more interesting to see was that the $5,000 - $15,000 range of loans was available all the way down to a credit score of 600. That’s a lot lower than I had expected, so I’ve learned something there. Also interesting to see the few cases where high debt/income ratios were still given loans. They were few and far between. Note that any debt/income ratio larger than 10 is shown as 10.01. So the structure for really high debt/income ratios is pretty much lost.
I’m curious about the breakdown by the original loan amount buckets, so let’s do that plot and then move on.
Apparently, keeping your debt-to-income ratio beneath 0.4 and a credit score > 620 is a great way to qualify for a loan of any type.
I want to try to assess the goodness of a borrower with respect to the requested loan amount. I’ll parameterize a good borrower by the net principal loss (LP_NetPrincipalLoss). Obviously, I’d want to only look at loans that have been closed, so only where ClosedDate_decimal is not NA. Here are some parameters that I think will be important to my model:
prosper_clean <- subset(prosper, (CreditScoreMean > 250) & (LP_NetPrincipalLoss > 0) &
!is.na(LP_NetPrincipalLoss) & !is.na(CreditScoreMean) &
!is.na(EmploymentStatusDuration) & !is.na(OpenCreditLines) &
!is.na(InquiriesLast6Months) & !is.na(AmountDelinquent) &
!is.na(AvailableBankcardCredit) & !is.na(DebtToIncomeRatio) &
!is.na(ClosedDate_decimal))
n_entries <- length(prosper_clean$Term)
all_indices <- seq(1:n_entries)
set.seed(42)
rand_samp <- sample(1:n_entries, as.integer(0.75 * n_entries)) #replace = False
prosper_train <- prosper_clean[rand_samp, ]
prosper_test <- prosper_clean[!(all_indices %in% rand_samp), ]
m1 <- lm(I(LP_NetPrincipalLoss) ~ I(CreditScoreMean),
data=prosper_train)
m2 <- update(m1, ~ . + LoanOriginalAmount)
m3 <- update(m2, ~ . + EmploymentStatusDuration)
m4 <- update(m3, ~ . + IsBorrowerHomeowner)
m5 <- update(m4, ~ . + OpenCreditLines)
m6 <- update(m5, ~ . + InquiriesLast6Months)
m7 <- update(m6, ~ . + CurrentDelinquencies)
m8 <- update(m7, ~ . + AmountDelinquent)
m9 <- update(m8, ~ . + AvailableBankcardCredit)
m10 <- update(m9, ~ . + StatedMonthlyIncome)
m11 <- update(m10, ~ . + DebtToIncomeRatio)
mtable(m11)
##
## Calls:
## m11: lm(formula = I(LP_NetPrincipalLoss) ~ I(CreditScoreMean) + LoanOriginalAmount +
## EmploymentStatusDuration + IsBorrowerHomeowner + OpenCreditLines +
## InquiriesLast6Months + CurrentDelinquencies + AmountDelinquent +
## AvailableBankcardCredit + StatedMonthlyIncome + DebtToIncomeRatio,
## data = prosper_train)
##
## ================================================
## (Intercept) 155.790
## (283.516)
## I(CreditScoreMean) -0.223
## (0.432)
## LoanOriginalAmount 0.733***
## (0.005)
## EmploymentStatusDuration 0.417
## (0.257)
## IsBorrowerHomeowner: True/False -116.942*
## (48.498)
## OpenCreditLines -6.423
## (4.843)
## InquiriesLast6Months 12.168*
## (6.074)
## CurrentDelinquencies 2.979
## (8.833)
## AmountDelinquent 0.005
## (0.003)
## AvailableBankcardCredit 0.003
## (0.002)
## StatedMonthlyIncome -0.003
## (0.007)
## DebtToIncomeRatio -56.851*
## (23.220)
## ------------------------------------------------
## R-squared 0.800
## adj. R-squared 0.800
## sigma 2029.329
## F 3318.337
## p 0.000
## Log-likelihood -82595.505
## Deviance 37603059672.875
## AIC 165217.011
## BIC 165309.580
## N 9143
## ================================================
The \(R^2\) coefficient is fairly high (0.800). I’ve learned from https://cran.r-project.org/web/packages/texreg/vignettes/v55i08.pdf that mtable throws asterisks onto coefficients of significance, with the number of asterisks corresponding to the significance level of that coefficient.
Perhaps we can refine our model. Let’s first take out the following coefficients that appear to matter least: CreditScoreMean, EmploymentStatusDuration, AmountDelinquent, CurrentDelinquencies, OpenCreditLines, StatedMonthlyIncome, AvailableBankcardCredit.
m1 <- lm(I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount),
data=prosper_train)
m2 <- update(m1, ~ . + LoanOriginalAmount)
m3 <- update(m2, ~ . + IsBorrowerHomeowner)
m4 <- update(m3, ~ . + InquiriesLast6Months)
m5 <- update(m4, ~ . + DebtToIncomeRatio)
mtable(m5)
##
## Calls:
## m5: lm(formula = I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount) +
## LoanOriginalAmount + IsBorrowerHomeowner + InquiriesLast6Months +
## DebtToIncomeRatio, data = prosper_train)
##
## ================================================
## (Intercept) 18.118
## (39.817)
## I(LoanOriginalAmount) 0.731***
## (0.004)
## IsBorrowerHomeowner: True/False -128.066**
## (43.727)
## InquiriesLast6Months 11.915*
## (5.740)
## DebtToIncomeRatio -58.216**
## (22.527)
## ------------------------------------------------
## R-squared 0.800
## adj. R-squared 0.800
## sigma 2029.719
## F 9119.292
## p 0.000
## Log-likelihood -82600.765
## Deviance 37646351022.611
## AIC 165213.531
## BIC 165256.255
## N 9143
## ================================================
The \(R^2\) value hasn’t changed a bit. Let’s test this out on our test set, find the residuals, and plot up some stuff.
estimated_losses <- predict(m5, newdata = prosper_test, interval="prediction",
level = 0.95)
prosper_test$residual_loss <- estimated_losses[, 1] - prosper_test$LP_NetPrincipalLoss
While it looks like I’m over-estimating a ton of entries, I think I might actually be underestimating a bunch and just not realize it.
summary(prosper_test$residual_loss/prosper_test$LP_NetPrincipalLoss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.7891 -0.2102 -0.0868 1.6860 0.2503 942.0000
On average I’m overestimating loss from a given borrower, though my median is saying I’m pretty close with my estimations. That mean is definitely affected by the few outliers, as it exists well outside of my 3rd quartile. Let’s histogram it up.
Last bits for this model. I want to check that these are consistent regardless of my random seed, so let’s do it again!
set.seed(79)
rand_samp <- sample(1:n_entries, as.integer(0.75 * n_entries)) #replace = False
prosper_train <- prosper_clean[rand_samp, ]
prosper_test <- prosper_clean[!(all_indices %in% rand_samp), ]
m1 <- lm(I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount),
data=prosper_train)
m2 <- update(m1, ~ . + LoanOriginalAmount)
m3 <- update(m2, ~ . + IsBorrowerHomeowner)
m4 <- update(m3, ~ . + InquiriesLast6Months)
m5 <- update(m4, ~ . + DebtToIncomeRatio)
mtable(m5)
##
## Calls:
## m5: lm(formula = I(LP_NetPrincipalLoss) ~ I(LoanOriginalAmount) +
## LoanOriginalAmount + IsBorrowerHomeowner + InquiriesLast6Months +
## DebtToIncomeRatio, data = prosper_train)
##
## ================================================
## (Intercept) 23.605
## (40.185)
## I(LoanOriginalAmount) 0.727***
## (0.004)
## IsBorrowerHomeowner: True/False -131.634**
## (44.084)
## InquiriesLast6Months 12.605*
## (5.762)
## DebtToIncomeRatio -43.710*
## (22.263)
## ------------------------------------------------
## R-squared 0.796
## adj. R-squared 0.796
## sigma 2047.828
## F 8910.741
## p 0.000
## Log-likelihood -82681.979
## Deviance 38321125224.589
## AIC 165375.959
## BIC 165418.683
## N 9143
## ================================================
Not bad! Not bad at all! Now, how do I interpret this in the context of the original idea of a borrower’s goodness?
The prosper data set contains 113,937 loans, each with 81 characteristics. Most characteristics are continuous, though a few are qualitative attributes:
The main features of the data set are LenderYield, LoanOriginalAmount, and LP_NetPrincipalLoss. Above I used other characteristics of loans to try to predict LP_NetPrincipalLoss on loans. It wasn’t half bad!
I’d initially thought that CreditScoreMean, EmploymentStatusDuration, IsBorrowerHomeowner, OpenCreditLines, InquiriesLast6Months, CurrentDelinquencies, AmountDelinquent, AvailableBankcardCredit, StatedMonthlyIncome, and DebtToIncomeRatio were all important when considering whether a loan would end up with a net loss, due to several facts and suggestions gleaned from investigations into personal credit. However, it ended up being that only the Loan Original Amount, the homeowner status, the last 6 months of credit inquiries, and the debt to income ratio were significant (as far as an inherently-flawed linear model was concerned).
Quite a few actually. Any variable with date information I broke apart into columns containing the day, the month, the year, and the date as a decimal of the year using strptime and strftime.
I created a variable called LoanOriginalAmount.buckets, containing 5 ranges of dollar amounts within which a loan could reside. It was helpful for looking at some distributions with respect to the loan amount range.
Lastly, I created a variable for the mean credit score, based on the upper and lower limits of credit scores for each borrower. Its purpose was so that I could pin down roughly which credit scores correspond to which loan/borrower characteristics, instead of having to interpret from a range.
The only unusual distributions that I took note of involved the debt-to-income ratios and the mean credit scores. It’s noted in the variable definitions file that the debt-to-income ratio metric caps at 10.01, regardless of how high the actual ratio might be. That produced slight anomalies in my distribution, and I ended up focusing mostly on debt-to-income ratios below 1.
The mean credit scores were odd in that the distribution went down to 9.5 even though credit scores shouldn’t really be below about 360. For my model analysis I ended up just using only credit scores that were above 250 and not “NA”.
Additionally, specifically for my linear model, I made sure every loan considered had a principal loss > 0, had some employment status duration (since it was a part of my initial variable set), open credit lines, inquiries in the last 6 months, some delinquent amount, some bankcard credit, some debt to income ratio, and some close date (to ensure that the loan was finished and not just late in payment). Basically I wanted to make sure that every variable considered for my model had a value that could be used to train the model.
One interesting thing is that the estimated loss rate on loans had effectively little correlation with the actual loss rate. This could be a couple things. One may be that I’m just misinterpreting the Estimated Loss The variable is described as “Estimated loss is the estimated principal loss on charge-offs.” I calculated the actual loss rate as LP_NetPrincipalLoss / LoanOriginalAmount, or in other words, the principal remaining uncollected after any recoveries divided by the original principal of the loan. It may not be a full 1-to-1 correlation between what Prosper refers to as a loss rate, but there should be some correlation if their estimated loss rate is based on real data. The correlation coefficient from with(prosper, cor.test(LP_NetPrincipalLoss/LoanOriginalAmount, EstimatedLoss)) produces a coefficient of 0.210. Even smaller when only considering loans that posted an actual loss (0.084).
On the other hand, the amount lost correlated pretty well with the loan principal, which is to some degree expected. The higher the loan, the more that can be lost. However, it’s interesting to note that for loans of higher principals, it happens more often that if there is a loss, then larger fractions of the loan are likely to be lost. This is illustrated pretty well in the following diagram:
Interestingly enough, the “mean” line in red says that no matter what the loan principal, on average there’s not much difference in the fraction lost (considering the scatter). However the density of points suggest differently.
Lastly, Loan Principal vs Mean Credit Score. Now it seems as if any credit score can get a small loan. However, to get the highest loans you need at least a credit score of 600. Oddly enough, borrowers with the highest credit scores (\(\ge800\)) don’t borrow the most. They borrow at about the same level of those with low credit scores (\(\le660\)).
I found the relationship between the debt-to-income ratio and mean credit score to be somewhat interesting. Notably, the debt-to-income ratio seems to reach its highest in the middle of the mean Credit Score distribution. This is more of personal interest than intrinsic interest, as I’d expected the higher debt-to-income ratios to come with the lowest credit scores. Upon further thought though it makes sense. Borrowers with bad credit would only be permitted to borrow if their debt-to-income ratio is low. It’d be silly for a bank to lend money to someone shown to be bad with credit (low credit score) who also is already under a mountain of debt. On the other end, those with high credit would rarely have a high debt-to-income ratio.
The other relationship between non-main features I found interesting was mean credit score vs the listing category. For categories 4 and 5 (Personal Loan and Student Use), the credit score median and range are lower than all others. I suppose that necessity allows for those with lower credit to borrow. In contrast, a frivolous expense like a boat has a higher median credit score, as well as a much higher 75th percentile. All others are more or less in the same ranges for median and quartiles. What this means to me as a potential borrower is that if I want to get loan money from a bank, a credit score of about 710 is a good place to start.
One last one was between the total credit lines in the past 7 years vs the stated monthly income. It makes sense when I think about it, but apparently the more money you make, the more credit lines you’re likely to have opened. In fact, making more money seems to imply a lower limit to the amount of credit lines open!
Objectively the strongest relationship that I saw for this data set was between the net principal loss and the loan principal amount. However, this isn’t really a fair assessment of strong relationships within this data set as these are not independent variables. Simply, the more money borrowed, the more money that can be lost when the loan isn’t paid. It would’ve been interesting to see if the loss RATE scaled with loan principal, but when I looked at that it showed an effectively flat distribution in loss rates across all loan principals.
The strongest relationship I found between seemingly unrelated components was the Total Credit Lines in the Past 7 Years vs Stated Monthly Income. The more money you make, the more credit you take.
Lender Yield is generally higher for higher-risk loans, depending of course on borrower APR. Flattened to 2D it’s easy to see that Lender Yield is higher for higher APR. That’s the point of having a higher APR.
A higher monthly income generally affords for higher loans, but being listed as employed (either as generally employed or full-time) shows a much wider field of potential in borrowed loans. Holding monthly income constant, more employment means you can borrow far larger loans, and self-employment isn’t quite as trusted.
Lender Yield vs homeowner status, split by income range. What surprised me were a couple things. The first was that unemployed homeowners provided a larger yield rate on their loans than unemployed people without homes. It’s surprising because it bucks the trend seen in every other income range. The other thing that was surprising was that lender yield rate was higher for those who made the LEAST money. I’d have expected that those who made more money would provide a higher yield rate, but perhaps it’s only that those who have little money to spare are given higher APRs as lenders don’t trust them as much as those with higher income.
Lender Yield vs borrower APR, split by credit grade. As mentioned above, lender yield is higher for higher-risk loans. However, the split on credit grade makes lenders almost seem predatory, making their larger yield (rates) on those with a shaky credit history. Coloring by the actual loan amount makes it seem a little less predatory, with higher-risk borrowers only acquiring generally smaller loans, however these are likely loans out of necessity. It makes me wonder what sorts of holes these borrowers are digging for themselves by taking out loans with shaky credit.
I created a linear model to try and predict the amount that may be lost on a closed loan. The model was trained on 75% of randomly-selected data that posted a loss (without replacement) and tested on the remaining 25% posting a loss. I wanted to use this estimated loss as a proxy for borrower reliability.
One strength of my model is that, based on nothing else besides the magnitudes of coefficients output by my model, it can predict principal loss pretty well, accounting for 80% of the variation in the original loan amount. A large fraction of my residuals are near zero and a fair number overestimate the loss, which in a more realistic setting would help the lending party be more conservative with who they lend to. Another strength is that the model is fairly consistent, regardless of which random 75% of my data is selected to train it.
A major weakness however is that this model is not at all based on any solid financial theory. I can’t use this model to back up any ideas about borrower behavior, it’s just how the numbers work. Another weakness is that the model only considers loans that HAVE a loss, in addition to only considering loans that have what I considered valid measurements in the fields I was using as variables. In reality only represents about 10.7% of the total data set, so this isn’t really a robust model and a potential lender would be remiss in using it to assess the actual reliability of a potential borrower. I would call it the extremely-pessimistic estimation of loss on a given loan.
The distribution of employment statuses across borrowers, split by the term length of the loans they borrow. Because the y-xis is logarithmic, it’s interesting to see just how many loans are long-term. It could be that fully-employed (Employed, Full-time, and Self-employed) folks take out short term loans just to get over a hump or make a new purchase that they know they can cover in the near term with their income.
Box plots of the lender yield vs homeowner status, split by income range. The boxes show the 3rd/1st quartile plus/minus 1.5x the interquartile range of rates in the vertical lines (top/bottom). Additionally, it shows the first, second (median, thick), and third quartiles in the horizontal lines. The income ranges are based on the income of the borrower at the moment the loan was issued.
What we’re seeing here is that for every income range except where income wasn’t reported (the “NA” set), the lender yield rate is higher for those that don’t own homes. What’s more, the less the borrower makes the higher the rate of return for the lender (caused no doubt by a higher borrower APR). Interesting however that the trend of lender yield rates is so similar across 5 very different income ranges.
Scatter plots of the debt-to-income ratio of the borrower vs mean credit score, split by the original amount of each loan. Overplotted in green is the average debt-to-income ratio for each bin of average credit score, with a bin width of 20. The axis for debt-to-income ratios is heavily truncated; the maximum is 10.01 even though some debt-to-income ratios are actually higher, and I’m focusing on the bulk of the distribution, where the ratio is less than about 1.2.
It’s interesting to note that while the biggest loans are largely only available to those with credit scores > 600, loans of all sizes are issued for the full range of credit scores. What’s universal between loan sets is that the debt-to-income ratios of borrowers is chiefly below 0.7, with the highest loans ($15,000+) falling mostly below 0.4. So, while credit scores may take on a wide range, one’s debt compared to their income will be a large determinant in how much money Prosper will allow one to borrow. It’s also interesting to see that having the highest credit score doesn’t necessarily translate to taking out tons of credit debt. We see this in te distribution of debt-to-income ratios, which stay low at the higher credit scores; they probably maintain their credit by keeping low debt to begin with. Contrast this with the lower credit scores, where the debt to income ratios at that end are likely low because that’s the only way that they could be approved for a loan.
Fix these later Basic Description: Well-sized data set with many characteristics, though few of which are useful for studying borrower behavior. The set begins in 2005 and follows borrowers from Prosper through early 2014.
Difficulties: Finding meaningful characteristics to explore without dumping a ton of time into an N-by-N plot with ggpairs. Even using just 6 categories was a slog for my machine.
Successes: Time series data!
Ideas and Improvements: Ethnicity data beyond simply the state of borrower. Maybe also education level.